NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.
Goal is to use this data to categorize risk of death by day of the week on a county level.
We will start with monthly data and attempt to work into the day of the week. Due to anonymization of data, days of the week are specified but not the actual dates. I.e, All Fridays of the month are bundled into a single Friday variable.
Underlying cause of death 2018-21
5.Initial imports were performed on Texas and Oklahoma ranging from 2018-21 as these two locations were specifically mentioned by NB as some of their primary areas of focus.
adj_death values are calculated by substracting COVID
deaths from in-hospital deaths. crude_rate_10k values are
calculated by dividing adj_death values by
county_population values and multiplying by 10,000. This
gives a mortality rate per 10,000 people per county.
## Rows: 127,104
## Columns: 9
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ wkday <fct> Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, O…
## $ county <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair, …
## $ county_code <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001, …
## $ county_population <dbl> 22082, 22082, 22082, 22082, 22082, 22082, 22082, …
## $ wkday_adj_deaths <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ wkday_crude_rate_10k <dbl> 0.4528575, 0.4528575, 0.0000000, 0.4528575, 0.000…
## Rows: 15,888
## Columns: 8
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,…
## $ date <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01…
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK,…
## $ county <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair…
## $ county_code <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001…
## $ monthly_adj_deaths <dbl> 16, 20, 15, 14, 1, 11, 10, 17, 1, 11, 1, 17, 1,…
## $ monthly_crude_rate_10k <dbl> 7.2457205, 9.0571506, 6.7928630, 6.3400054, 0.4…
## Rows: 1,324
## Columns: 7
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, …
## $ county <fct> Adair, Alfalfa, Atoka, Beaver, Beckham, Blaine, …
## $ county_code <dbl> 40001, 40003, 40005, 40007, 40009, 40011, 40013,…
## $ county_population <dbl> 22082, 5754, 13838, 5319, 21709, 9485, 47192, 28…
## $ annual_adj_deaths <dbl> 156, 39, 88, 38, 145, 79, 322, 258, 707, 349, 31…
## $ annual_crude_rate_10k <dbl> 70.64577, 67.77894, 63.59300, 71.44200, 66.79257…
Early on in EDA it became evident that some replaced
Suppressed values were skewing data in counties with low
populations. For example, Loving, TX had just a population of 57, but
with 5 deaths the crude rate is at a dramatic 877 deaths per 10k
people.
df_annual_5 %>%
select(state, county, county_population, annual_adj_deaths, annual_crude_rate_10k) %>%
arrange(-annual_crude_rate_10k) %>%
slice(1:100)When Suppressed values are replaced with 5, we also see it’s possible
to have negative death rates. For example, if there are 5 months in a
year with Suppressed COVID hospital deaths, they are all converted to
the value 5. The summarised value would then be 25 COVID hospital
deaths. The non-suppressed data shows, however, that the total
in-hospital deaths over that same year is 10. Since our adjusted death
rate subtracts in hospital deaths from COVID deaths we end up with
10 - 25 = -15 in-hospital, non-COVID deaths. Logically,
this doesn’t make sense and we could assume the true suppressed values
are between 1 and 2 for each fo those months.
This histogram below illustrates the issue with using a value of 5 for Suppressed values, resulting in negative value as well as dramatically high rates for low population counties.
If we replace values with 1 instead of 5 we see the distribution follows a more normal curve as would be expected. However, we do see that some rates appear to still fall into the negative ranges. These specific examples are again artifacts of our imperfect replacement with 1.
ggplotly(
df_annual_neg %>%
ggplot(aes(x=annual_crude_rate_10k))+
geom_histogram(bins=50, fill=NA, color="black")+
geom_vline(xintercept = 0, color="red")+
labs(
title="Better distribution of annual crude death rates"
)
)Because we don’t want to remove any counties from representation we
will simply add 3 to the annual_hospital_deaths values so
that the rate remains positive. An ideal solution may have been to
perform a more advanced imputation with a glm, but this gets us in the
right direction.
The rows in question with negative values.
Modifying said rows and displaying the final adjusted calculated variables:
ggplotly(
df_annual %>%
ggplot(aes(x=annual_crude_rate_10k))+
geom_histogram(bins=50, fill=NA, color="black")+
# geom_vline(xintercept = 0, color="red")+
labs(
title="Final distribution of annual crude death rates"
)
)We can also see the distribution remains relatively similar over the years for each state.
Initial attempts at visualizing crude death rates over time
We can show that over time the death rates have migrated from one county to another, and that this rate is not correlated to county population.
Joining the map data with crude death rate per 10k people per county, and then binning those values into 3 equal groups, and plotting them on maps.
Starting with Oklahoma
# Annual map data ----
## Oklahoma ----
# join crude rates with map data
df_ok_2021 <- ok_counties %>%
left_join(
df_annual %>%
filter(year == '2021' & state == 'OK') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_ok_2020 <- ok_counties %>%
left_join(
df_annual %>%
filter(year == '2020' & state == 'OK') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_ok_2019 <- ok_counties %>%
left_join(
df_annual %>%
filter(year == '2019' & state == 'OK') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_ok_2018 <- ok_counties %>%
left_join(
df_annual %>%
filter(year == '2018' & state == 'OK') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
# divide into equal groups by rate
df_ok_2021$group <- as.factor(cut_number(df_ok_2021$rate, 3))
df_ok_2020$group <- as.factor(cut_number(df_ok_2020$rate, 3))
df_ok_2019$group <- as.factor(cut_number(df_ok_2019$rate, 3))
df_ok_2018$group <- as.factor(cut_number(df_ok_2018$rate, 3))
# make plots
gg_ok_2021 <- df_ok_2021 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2021')
gg_ok_2020 <- df_ok_2020 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2020')
gg_ok_2019 <- df_ok_2019 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2019')
gg_ok_2018 <- df_ok_2018 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2018')
# populations
df_ok_2021_pop <- ok_counties %>%
left_join(
df_annual %>%
filter(year == '2021' & state == 'OK') %>%
select(county, county_code, county_population) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "population"))
)
df_ok_2021_pop$group <- as.factor(cut_number(df_ok_2021_pop$population/1000, 3))
gg_ok_2021_pop <- df_ok_2021_pop %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2021 Population')It’s actually quite clear that more rural areas in texas are suffering from a higher crude death rate.
# join crude rates with map data
df_tx_2021 <- tx_counties %>%
left_join(
df_annual %>%
filter(year == '2021' & state == 'TX') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_tx_2020 <- tx_counties %>%
left_join(
df_annual %>%
filter(year == '2020' & state == 'TX') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_tx_2019 <- tx_counties %>%
left_join(
df_annual %>%
filter(year == '2019' & state == 'TX') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
df_tx_2018 <- tx_counties %>%
left_join(
df_annual %>%
filter(year == '2018' & state == 'TX') %>%
select(county, county_code, annual_crude_rate_10k) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "rate"))
)
# divide into equal groups by rate
df_tx_2021$group <- as.factor(cut_number(df_tx_2021$rate, 3))
df_tx_2020$group <- as.factor(cut_number(df_tx_2020$rate, 3))
df_tx_2019$group <- as.factor(cut_number(df_tx_2019$rate, 3))
df_tx_2018$group <- as.factor(cut_number(df_tx_2018$rate, 3))
# make plots
gg_tx_2021 <- df_tx_2021 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2021')
gg_tx_2020 <- df_tx_2020 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2020')
gg_tx_2019 <- df_tx_2019 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2019')
gg_tx_2018 <- df_tx_2018 %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2018')
df_tx_2021_pop <- tx_counties %>%
left_join(
df_annual %>%
filter(year == '2021' & state == 'TX') %>%
select(county, county_code, county_population) %>%
mutate(county_code = as.character(county_code),
county = as.character(county)) %>%
set_colnames(c("NAME", "GEOID", "population"))
)
df_tx_2021_pop$group <- as.factor(cut_number(df_tx_2021_pop$population, 3, labels=FALSE))
gg_tx_2021_pop <- df_tx_2021_pop %>%
ggplot(aes(fill=group))+
geom_sf() +
theme_void() +
scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
ggtitle('2021 Population')